Deep learning is a branch of machine learning and is a part of artificial intelligence. Deep learning works because neural networks make up the human brain. In deep learning, nothing is specifically programmed. It is basically a class of machine learning that uses many non-linear methods to perform extraction and transformation results. The output of each previous layer is used as the input of each subsequent layer.
Deep learning models are powerful enough to target reality on their own, require little guidance from programmers, and are effective at: Solving the size problem. Use deep learning algorithms, especially when we have a lot of inputs and outputs.
Since deep learning is based on machine learning, machine learning itself is a form of artificial intelligence, and the idea behind artificial intelligence is to imitate human behavior, so "the idea of deep learning is that such an algorithm can imitate the brain."
Deep learning is done with the help of neural networks and the idea behind this is that the power of neural networks is biological neurons which are nothing but brain cells.
Convolutional Neural Networks (CNN) are now very popular in the deep learning community. These CNN models are used in many applications and names, but are especially useful in image and video projects.
CNN has filters (also called kernels). Using convolution techniques, kernels are used to extract relevant information from the input. Let's try to understand the importance of filters by using images as input data. When you combine the image with the filter you get a custom report:
Though convolutional neural networks were developed to handle problems with picture data, they can function well with sequential inputs.
CNNs learn on their own without specifying filters. These filters help remove unwanted features from incoming data.
A sensor (or neuron) can be thought of as a feedback loop. Each layer of the artificial neural network (ANN) consists of many sensors/neurons. Since the input is processed only forward, ANN is also called a feedforward neural network:
As you can see, ANN has three layers: input layer, hidden layer and output layer. The input method takes the inputs, the output method returns it as output, and the output method returns the result. Essentially each layer tries to learn some weight.
ANN can be used to solve problems related to:
Tabular data
Image data
Text data
All non-linear functions from neural networks can be learned. For this reason, these networks are often called universal function approximators. Artificial neural networks can learn weights that represent feedback for an output.
Making a function is one of the main reasons for global prediction. The activation function gives the nonlinear properties of the network. This helps the network learn the input-output relationship.
RNNs are closely related to hidden states. This loop requirement ensures that the given data contains consistent data.
We can use Recurrent Neural Networks to solve problems related to:
Time series data Text data Audio data When making predictions, RNN will record the order of: data found in the input data , i.e. the order of words in the text:
The parameters of RNNs are shared between time steps. This is commonly referred to as parameter sharing. As a result, there are fewer parameters to train, lowering the computational cost.
A feedforward neural network (also known as FNN) is a simple type of neural network in which information flows in a single direction, from input to output, with one or more layers in between. The output of each previous hidden layer is used as the input of the next layer of the feedforward neural network. Each layer has one or more sigmoid neurons that form the building blocks of neural networks (this can be a non-linear function such as tanh, ReLU, leaky ReLU, etc. However, we will consider the sigmoid function in the column of this language). . Simply put, FNN uses many sigmoid neurons in various layers to determine the relationship between input “x” and output “y” y = f(x). (The number of layers and the number of sigmoids per layer may vary and are considered hyperparameters and this article will not explain how to set them)
LSTM can solve the incompleteness problem. It does this by ignoring (forgetting) unnecessary data(s) on the network. LSTM will miss data if other inputs (passwords) do not provide useful information. When new information emerges, the network decides what to ignore and what to remember.
LSTM Architecture
Let's see the difference between RNN and LSTM.
In RNN we have a very simple model with only one activation function (tanh).
We have many things in LSTM that not only facilitate the operation of the network, but also give the network the ability to forget and remember information.
Cell state (Memory cell)
Forget gate
Input gate
Output gate
It is the first part of LSTM and operates throughout the entire LSTM unit. It can be thought of as a conveyor belt.
This cellular state is responsible for memory and forgetting. It depends on the entry point. This means that some previous information must be remembered, some must be forgotten, and some new information must be added to memory. The first operation (X) is a pointer operation and is nothing but dividing the state of the cell into an array [-1, 0, 1]. Data equal to 0 will be ignored by LSTM. Another function is the (+) function, which is responsible for adding some new information to the situation.
Forget LSTM gates, as the name suggests, decide which words should be forgotten. The sigmoid curve is used to determine this. This sigmoid layer is called the "forgetful gate layer".
It punctuates the product of h(t-1) and x(t) and says C(t-1) for each unit with the help of the sigmoid layer. Digital output A number is between 0 and 1. The output being "1" means we will keep it. "0" means forgotten.
The Entry gate provides new data to the LSTM and decides whether to store the new data in the cell state.
This has 3 parts -
The Sigmoid process determines the value to be set. This layer is called the "gateway layer". The Tanh activation function layer creates a new candidate value vector Č(t) that can be added to the state. We then combine the 2 outputs (i (t) * Č(t) and update the cell state. The new cell state C(t) is obtained by adding the values of the memory gate and the access table.
The output of the LSTM unit depends on the new cell state.
First, a sigmoid layer decides what parts of the cell state we’re going to output. Then, a tanh layer is used on the cell state to squash the values between -1 and 1, which is finally multiplied by the sigmoid gate output.
Perceptrons are considered the first generation of artificial neural networks. This is part of a larger network of neurons in the human brain. It is used for binary classification, meaning it can split the input into one of two groups.
import tensorflow as tf
# Define the model
model = tf.keras.Sequential([
tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=(2,))
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Define input data
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 0, 0, 1] # OR operation
# Convert to numpy arrays
X = tf.constant(X, dtype=tf.float32)
y = tf.constant(y, dtype=tf.float32)
# Train the model
model.fit(X, y, epochs=1000, verbose=0)
# Evaluate the model
loss, accuracy = model.evaluate(X, y)
print(f'Loss: {loss}, Accuracy: {accuracy}')
# Perform predictions
predictions = model.predict(X)
print("Predictions:")
for i, input_data in enumerate(X):
print(f"Input: {input_data.numpy()}, Prediction: {predictions[i][0]}")
1/1 [==============================] - 0s 350ms/step - loss: 0.5343 - accuracy: 0.5000 Loss: 0.5343436002731323, Accuracy: 0.5 1/1 [==============================] - 0s 206ms/step Predictions: Input: [0. 0.], Prediction: 0.28132081031799316 Input: [0. 1.], Prediction: 0.2191612720489502 Input: [1. 0.], Prediction: 0.5359167456626892 Input: [1. 1.], Prediction: 0.45295795798301697
Perceptron takes multiple features as input, which are combined with the help of a weighted sum and then passed through the activation function. The output from the activation function gets labeled as one of the two classes.
Perceptron may take many iterations through the data to learn important features of the two classes. The Perceptron model improves itself by adjusting the weights and biases which are the input to the neuron. When we train the model, the Perceptron is presented with a series of input-output pairs and the weights and bias are adjusted to minimize the error during its predictions. The process of adjusting the weights and bias to minimize the error is done by an algorithm known as Perceptron Learning Rule.
But why do we need weights and bias?
Weights and biases are important parameters that allow the model to learn the mapping between the input features and output predictions by minimizing the error between the predicted values and target values. Weights define the importance of each feature while bias shifts the output by a certain amount.
The principle behind Perceptron and how it works is really intuitive. Let’s see how it works mathematically:
We know that the equation of the linear model is given as:
f(x) = w.X + b
where, X = Input features
w = weights
b = bias
The perceptron trick, also known as the perceptron learning algorithm or perceptron update rule, is a simple algorithm used to train a binary classifier, specifically a single-layer perceptron. A perceptron is a type of artificial neuron that takes multiple binary inputs, applies weights to them, sums them up, and passes the result through an activation function to produce an output.
Here's a step-by-step explanation of the perceptron trick:
Initialize Weights and Bias:
Start with random weights (w) for each input feature and a bias term (b).
Calculate the Weighted Sum:
For a given input instance x=(x1,x2,…,xn), calculate the weighted sum z as follows:
z=w1⋅x1+w2⋅x2+…+wn⋅xn+b
Apply the Activation Function:
Pass the weighted sum through an activation function. In the case of a perceptron, the typical activation function is a step function (e.g., Heaviside step function), which outputs 1 if the weighted sum is greater than or equal to 0, and 0 otherwise.
Update Weights and Bias:
If the predicted output (y^) does not match the actual output (y), update the weights and bias using the perceptron learning rule:
wi←wi+α⋅(y−y^)⋅xi
b←b+α⋅(y−y^)
wher α is the learning rate, y is the true output, and^y^is the predicted output.
The perceptron loss function is often associated with the training of a perceptron, a type of artificial neuron used for binary classification. The goal of training a perceptron is to find a set of weights and a bias that allow the perceptron to correctly classify input instances into one of two classes (usually labeled as 0 or 1).
The perceptron loss function is a simple measure of the error between the predicted output and the true output for a given input. It is typically defined as follows: L(y,y^)=max(0,−y⋅y^)
Here:-
L is the loss function.
y is the true output label (either -1 or 1).
y ^ is the predicted output label.
The perceptron learning algorithm aims to minimize this loss function by adjusting the weights and bias during training. The weight update rule is derived from the gradient of this loss function with respect to the weights.
If ^ y ^ and y have the same sign (correctly classified), the loss is 0. If they have different
signs, meaning a misclassification has occurred, the loss is proportional to the distance of the predicted output from the correct output.
The perceptron learning rule involves adjusting the weights and bias in the direction that reduces the loss. The weight update rule for the i-th weight is given by:
wi←wi+α⋅(y−y^)⋅xi
wiis the i-th weight.
α is the learning rate.
xiis the i-th input feature.
Linear Separability Requirement:
The perceptron algorithm works well only when the data is linearly separable, meaning it can be divided into two classes by a hyperplane. If the data is not linearly separable, the perceptron may not converge or may take a long time to converge. Inability to Learn Complex Patterns:
Perceptrons are limited in their ability to learn complex patterns or relationships in data. They can only represent linear decision boundaries, which is a significant constraint for tasks that require capturing nonlinear relationships. Sensitivity to Outliers:
Perceptrons are sensitive to outliers in the training data. A single mislabeled instance can significantly affect the learning process and the resulting decision boundary. Lack of Probabilistic Outputs:
Perceptrons do not naturally provide probabilistic outputs. In classification tasks, having probability estimates can be valuable, especially in applications where uncertainty is a critical factor. Binary Classification Only:
The perceptron is designed for binary classification tasks and cannot directly handle multi-class classification problems. Extensions, such as the one-vs-all approach, can be used, but they may have their own limitations. Difficulty in Handling Continuous Inputs:
Traditional perceptrons are designed for binary inputs, and handling continuous inputs may require additional preprocessing or the use of alternative architectures, such as the multi-layer perceptron. Vanishing Gradient Problem:
In the context of deep learning, the vanishing gradient problem can limit the training of deep architectures. Perceptrons are a building block for neural networks, and in deep networks, gradients can become very small during backpropagation, making it challenging to update weights in the early layers. Fixed Learning Rate:
The perceptron learning algorithm uses a fixed learning rate. While learning rate schedules can be used to address this to some extent, finding an appropriate learning rate can be a non-trivial task. Noisy Data Sensitivity:
Perceptrons are sensitive to noise in the training data. Noisy or mislabeled instances can lead to suboptimal or incorrect models.
# Standard scientific Python imports
import matplotlib.pyplot as plt
# Import datasets, classifiers, and performance metrics
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
# The digits dataset
digits = datasets.load_digits()
# The data that we are interested in is made of 8x8 images of digits
_, axes = plt.subplots(2, 4)
images_and_labels = list(zip(digits.images, digits.target))
for ax, (image, label) in zip(axes[0, :], images_and_labels[:4]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Training: %i' % label)
# To apply a classifier on this data, we need to flatten the image
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# User input for hidden layer size
hidden_layer_size = int(input("Enter the size of the hidden layer: "))
# Create a classifier: a simple neural network (MLP)
classifier = MLPClassifier(hidden_layer_sizes=(hidden_layer_size,), max_iter=1000)
# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# Standardize data
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)
# Now predict the value of the digit on the second half
predicted = classifier.predict(X_test)
images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))
for ax, (image, prediction) in zip(axes[1, :], images_and_predictions[:4]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Prediction: %i' % prediction)
print("Classification report for classifier %s:\n%s\n"
% (classifier, metrics.classification_report(y_test, predicted)))
disp = metrics.plot_confusion_matrix(classifier, X_test, y_test)
disp.figure_.suptitle("Confusion Matrix")
print("Confusion matrix:\n%s" % disp.confusion_matrix)
plt.show()
Enter the size of the hidden layer: 100
Classification report for classifier MLPClassifier(max_iter=1000):
precision recall f1-score support
0 0.99 0.98 0.98 88
1 0.98 0.87 0.92 91
2 0.95 0.97 0.96 86
3 0.95 0.82 0.88 91
4 0.98 0.92 0.95 92
5 0.93 0.96 0.94 91
6 0.94 0.99 0.96 91
7 0.92 0.94 0.93 89
8 0.91 0.94 0.93 88
9 0.82 0.95 0.88 92
accuracy 0.93 899
macro avg 0.94 0.93 0.93 899
weighted avg 0.94 0.93 0.93 899
Confusion matrix:
[[86 0 1 0 0 0 1 0 0 0]
[ 0 79 1 1 0 0 0 0 1 9]
[ 0 0 83 2 0 0 1 0 0 0]
[ 0 0 2 75 0 4 0 4 5 1]
[ 1 0 0 0 85 0 0 1 0 5]
[ 0 0 0 0 0 87 2 0 0 2]
[ 0 1 0 0 0 0 90 0 0 0]
[ 0 0 0 1 2 0 0 84 1 1]
[ 0 1 0 0 0 1 1 1 83 1]
[ 0 0 0 0 0 2 1 1 1 87]]
#Below code For Method 2
import tkinter as tk
from tkinter import Canvas, Button
from PIL import Image, ImageDraw
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
# Load the digits dataset
digits = load_digits()
data = digits.images.reshape((len(digits.images), -1))
# Standardize the data
scaler = StandardScaler().fit(data)
data = scaler.transform(data)
# Create a classifier: a simple neural network (MLP)
classifier = MLPClassifier(hidden_layer_sizes=(64,), max_iter=1000)
classifier.fit(data, digits.target)
class DigitRecognizerApp:
def __init__(self, root):
self.root = root
self.root.title("Digit Recognizer")
self.canvas = Canvas(root, width=200, height=200, bg="white")
self.canvas.pack()
self.label = tk.Label(root, text="Draw a digit and click Predict")
self.label.pack()
self.predict_button = Button(root, text="Predict", command=self.predict_digit)
self.predict_button.pack()
self.clear_button = Button(root, text="Clear", command=self.clear_canvas)
self.clear_button.pack()
self.image = Image.new("L", (200, 200), color="white")
self.draw = ImageDraw.Draw(self.image)
self.canvas.bind("<B1-Motion>", self.paint)
def paint(self, event):
x1, y1 = (event.x - 1), (event.y - 1)
x2, y2 = (event.x + 1), (event.y + 1)
self.canvas.create_oval(x1, y1, x2, y2, fill="black", width=8)
self.draw.line([x1, y1, x2, y2], fill="black", width=8)
def predict_digit(self):
# Resize and preprocess the drawn image
img = self.image.resize((8, 8))
img = np.array(img)
# Check if the image is grayscale (2D) or has an additional channel (3D)
if len(img.shape) == 3:
img = img[:, :, 0] # Convert to grayscale if necessary
img = (img > 128).astype(np.uint8) * 255 # Binarize the image with a threshold
img = img.reshape(1, -1)
# Standardize the input data
img = scaler.transform(img)
# Predict the digit
prediction = classifier.predict(img)
self.label.config(text=f"Prediction: {prediction[0]}")
def clear_canvas(self):
self.canvas.delete("all")
self.image = Image.new("L", (200, 200), color="white")
self.draw = ImageDraw.Draw(self.image)
self.label.config(text="Draw a digit and click Predict")
if __name__ == "__main__":
root = tk.Tk()
app = DigitRecognizerApp(root)
root.mainloop()
Multi-Layer Perceptrons
The only problem with the Perceptron layer is that it cannot capture the nonlinearity of the data set and therefore cannot provide good results on nonlinear data. This problem can be easily solved with multi-layer understanding, which is very effective on non-linear data.
A convolutional neural network (FCNN) is a neural network whose architecture is such that all nodes or neurons in one layer are connected to all neurons in the next layer.
A multilayer perceptron is a set of input and output systems that may have one or more hidden layers, each containing many neurons. Multilayer neural networks can have functions that trigger thresholds such as ReLU or sigmoid. Neurons in a multilayer perceptron can use optional functions.
From the figure above, we can see that a fully connected multilayer perceptron consists of an input layer, two hidden layers and an output layer. Increasing the number of hidden layers and nodes in the layers will help capture the nonlinear behavior of the dataset and provide reliable results.
MLP Notations
The hardest thing to understand when working with neural networks is the backpropagation algorithm we use to train neural networks to iteratively change the weight and bias and reach the maximum.
Now when training a neural network, there are many weights and biases in the neural network. Backpropagation is also the process of adjusting the weight and bias, so the sign of each weight and bias is key to understanding backpropagation.
The figure above shows a multilayer neural network consisting of an input layer, a hidden layer and an output layer.
Symbol: Wijh
Where,
i = Node whose weight is transmitted to the node in the other layer.
j= Where the weight is reached.
h = layer that the weight reaches.
For example:
W111 = The weight of 1 ball in the 1st layer is transferred to 1 person in the previous layer.
W231 = The weight transferred to the 3rd of the previous layer is 1 hidden layer, starting from the 2nd of the previous layer.
W451 = Weight transferred from the 4th node of the previous layer to the 5th node of the 2nd hidden layer.
Where,
i = layer to which the deviation belongs.
j = node to which the deviation belongs.
Example:
b11 = Deviation of the first part of the first hidden layer.
b23 = Deviation of the third part of the second hidden layer.
b41 = Deviation of the first part of the fourth layer.
Where,
i = layer to which the deviation belongs.
j = node to which the deviation belongs.
Example:
11 = Output of node 1 of the 1st hidden layer.
45 = Output layer of the 5th node of the 4th hidden layer.
34 = Output of four of the hidden third layers.
MLP Intuition
Input layer:
Intuition: The central input layer is the properties of the input data. Each of these methods corresponds to a specific feature. The results of the nodes are the raw material of the network.
Hidden layer:
Intuition: The hidden layer is where the network learns patterns and representations from input data. Each hidden layer evaluates the weight of its input (from the previous layer) and passes the result to the activation function. The weights and biases associated with these connections are adjusted during training.
Activation function:
Intuition: The activation function indicates an incompatibility in the network. This nonlinearity allows MLP to learn and represent relationships in the data. Common functions include sigmoid, hyperbolic tangent (tanh), and rectified linear unit (ReLU). Number of Hidden Layers and Neurons:
Intuition: Depth (number of hidden layers) and width (number of hidden layers) The number of neurons in the hidden layer affects the model's ability to learn complex models. Deep networks can capture hierarchical features, while large networks can capture parallel patterns. Output layer:
Intuition: The output layer leads to the final prediction or classification. The number of nodes in the output layer depends on the task (such as binary classification, multi-class classification, regression). The activation function in the output process is chosen according to the task (e.g., sigmoid for binary distribution, softmax for multiclass). Weights and Biases:
Intuition: Weights and biases are what a parametric network learns as it learns. These measurements determine the strength of connections between neurons. The training process adjusts these parameters to minimize the difference between the prediction and the actual output. Forward propagation:
Intuition: During propagation, input data is transmitted layer by layer over the network. Each layer makes a change and then creates no lines. The final process produces an output estimate. Loss Function:
Intuition: The loss function measures the difference between the predicted output and the actual label. The aim during training is to reduce this loss. Common loss functions include mean square error for regression and cross-entropy for distribution. Backpropagation:
Intuition: Backpropagation is the process of adjusting the weights and biases according to the slope of unemployment. parameters. It allows the network to learn from its mistakes and improve its predictions.
Forward Propagation:
Purpose: It is the stage where the input data passes through the neural network and produces the output. Data travels through the network layer by layer, each layer weighting its own inputs and then applying the activation function. Steps: Input data is fed into the input process.
Input data is divided into weights and calculated to form an equation.
The time difference is added to the weight.
The results are passed through a function to show the inequality.
The output of each layer becomes the input of the next layer until the final output is produced.
Backward Propagation:
Purpose: This is the phase where the neural network learns by updating its weights based on the error between the predicted output and the actual target. The goal is to minimize this error. Steps: Compute Loss: Calculate the error between the predicted output and the actual target using a loss function. Backward Pass (Gradient Descent): Compute the gradient of the loss with respect to the weights of the network. Update the weights in the opposite direction of the gradient to minimize the loss. Update Weights: Adjust the weights using an optimization algorithm (e.g., gradient descent) to minimize the error. Iterate: Repeat the process for a specified number of iterations (epochs) or until a convergence criterion is met.
In this context, memory basically refers to the neural network's ability to accurately learn the training data; This means it is important to remember patterns in the data, including noise and outliers.
Here are some important points about memory in MLP:
Overfitting: Memory is very close to overfitting. Overfitting occurs when the model learns the training data well enough to capture noise or outliers that do not represent the underlying model in the data. Therefore, the effectiveness of the model will be affected when applied to new, unseen data.
The power of neural networks: MLPs are universal functions that can remember relationships in data. The number of parameters and the architecture of the network contribute to the potential.
The role of big data training: Neural networks will have an easier time remembering specific examples when the data is small. Therefore, it is essential to have a sufficiently large and diverse data set for training good models.
Editing Techniques: Various editing techniques can be used to reduce memory and load. Methods include dropout, normalization (L1 or L2 normalization), and early stopping.
Data augmentation: Transforming learning through techniques such as data augmentation can help prevent memorization of events by extending the model across many situations.
Validity and test set: It is important to evaluate the performance of the model through a separate validation process and finally a test. If the model performs well on the training set but not on this independent data, it will remember the training data.
Memory management is important for creating models that extend well for invisible objects. Evaluating the non-complex model, regularizing it, and ensuring the diversity and representativeness of the training data are important in solving MLP-related problems.
Simply put, gradient descent defines a variant that minimizes the cost function (prediction error). Gradient descent does this by moving the difference between the gradient to a set of values that minimizes the function.
Let's review each step of the gradient descent process:
Repeat until convergence:
Determine the change in scale when the subject is given the gradient. Use new values for gradient. 1.redo the step
Gradient Descent Variants
Depending on how much information was utilized to determine the gradient, there are three types of gradient descent:
Batch Gradient Descent Stochastic Gradient Descent Mini-batch Gradient Descent
Bulk Gradient Descent -
The error of each observation in the data is calculated by bulk gradient descent, also known as gradient descent, but will only be updated after all observations have been examined. Batch gradient descent requires a lot of work as all data must be stored in memory.
Stochastic Gradient Descent -
Stochastic Gradient Descent (SGD) updates the index for each observation. Therefore, only one observation is needed to establish the invariance of all of them. SGD is generally faster than batch gradient descent, but has a larger variance of error rate due to continuous updating and sometimes jumps rather than falls.
Mini batch gradient descent -
Combines stochastic gradient descent with batch gradient descent. A set of observations was updated thanks to mini-batch gradient descent. This method generally has batch sizes between 50 and 256 and is preferred for neural networks.
Simply put, gradient descent defines a variant that minimizes the cost function (prediction error). Gradient descent does this by moving the difference between the gradient to a set of values that minimizes the function.
Let's review each step of the gradient descent process:
Repeat until convergence is achieved:
Based on the data used to determine the gradient, gradient descent is divided into three types:
1.Batch Gradient Descent
2.Stochastic Gradient Descent
3.Mini-batch Gradient Descent
Batch Gradient Descent –
The error for each observation in the dataset is calculated via batch gradient descent, also known as vanilla gradient descent, however, an update is only made after all observations have been reviewed. Because the complete dataset must be kept in memory, batch gradient descent consumes a significant amount of processing resources.
Stochastic Gradient Descent –
For each observation, stochastic gradient descent (SGD) updates a parameter. As a result, it just requires one observation to update the parameter rather than looping through all of them. SGD is generally faster than batch gradient descent, but has a larger variance of error rate due to continuous updating and sometimes jumps rather than falls.
Mini-batch gradient descent –
Combines stochastic gradient descent and shower gradient descent. A set of observations was updated thanks to mini-batch gradient descent. This method generally has batch sizes between 50 and 256 and is preferred for neural networks.
I want to find the relationship function between x & y, so that the function describe the relationship between these two variable.
The function is,y = x * 2
Again this, how about this one -
The function is,y = x * 3
Again this, how about this one -
The function is,y = x * 2 + 3
Again this, how about this one -
Well this might be not that easy, Correct!
Gradient Descent . Using this technique it will find out the linear equation for this particular problem. The function is,y = x * 0.5 - 1.5
y = x * 0.5 - 1.5 This is also called prediction function. Means, in future if we change x values then y will change. That's the main point of Supervised Machine Learning technique or deep learning.
Based on age & affordability, come up with a function that can predict if a person will buy insurance or not
So here age and afforability is x & have_insurance is y or y = f(x) we are here trying to figure out this function f(x).
See here -
Here it's a simple neural network with a single neuron. We have two input neurons but often we are talking about neural we don't take into consideration the input neurons hence this has only single neuron it's a logistic function. Logistic regression is a very simple case of a neural network having a singel neuron and in logistic regression there"are two components- one is , Weighted Sum and another is Sigmoid function .
When we training a neural network, it should be remember that Gradient Descent is used during training of a neural network. SO Let's say, from above datasets, we have 13 data points, we will use all of this to train my network. So, I will feed the first sample,
I will initialize the weight to osme random value, here I do it as 1 and I will also initialize the bias which is here in zero and then we feed the first sample into the network, this is called Forward Pass.
For this sample it will be predicted value 0.99 & truth data is 0. So y is a predictcted value. Here we have the predicted value and the truth value. So, we will find the error value.
So, the error is -
for logistic regression we will use log loss, we don't use mean squared error or a mean absolute error. This is just a mathemetical equation for log loss. So, the first error comes to be a 4.6. Repeat this same process for all the samples in datasets.
Then we calculate the Loss -
So, after the first epoch(Epoch is going through all training samples once, so here once we gone through training samples number 1 to 13 that means we have completed one epoch) which is 4.31. Then our goal is to back propagate the loss, so we can adjust the weight1(w1) and weight2(w2). It should be remember, whenever we train neural network we have to train multiple epoch until we get the correct value of weight1(w1) and weight2(w2). Now, the goal is adjust w1 & w2 in such a way so that my log loss is reduced or less than 4.31. We can do this - w1 = w1 - adjust value, basically subtract or add some value with w1. So, what is the adjust value? We will do it using dervative. Like this -
Learning rate usually 0.01, people assume, its just limiting the derivative fucntion. Here, this shows the loss is changing for a given change in w1. Then we will find log loss derivative using this formula !
For bias -
We are using batch gradient descent actually. So, when should we stop to assume the weight value to training, we will stop when the error function is a convex function(its like a boat curve) or we found the global minima.
we are doing derivative which is tangent here shown and trying to move point and get the global minima. Lets move into code,
Disappearance of gradient descent when learning neural network models with multiple layers. This is because adding more layers to the network causes the weight values to disappear, making the computational cost almost constant. We say these results are nearly constant because the change is insignificant.
Let's understand this by examining the problem in depth.
What is Vanishing Gradient Descent
The decrease of the backpropagated error signal (often exponentially) with the increasing distance from the final layer (typically the output layer) is the problem pointing to Vanishing Gradient Descent.
But what does this mean? Let us understand this by creating a small network of our own.
This is a simple neural network, with one input layer, one hidden layer, and the output layer. The weights are written alongside the nodes in the diagram.
The model training process goes on in the following manner:
The inputs are multiplied with their respective weights, bias is added, and the first hidden layer is reached.
This hidden layer again performs the computations according to their specific weights, and the signal is propagated to the next layer (in this case, the output layer).
The output (O21) is sent to the loss function, from which the optimizer is created. The optimizer is responsible for adjusting the values of weights to decrease the loss value by the process of backpropagation.
The formula used to change the weight is as follows:
Wa(new) = Wa(old) - α * (δL/δWa)
where α is the learning rate, L is the loss function, Wa is the value of one of the features. It is its weight.
Now let's try to calculate the new value of W1 after the first step.
W1(new) = W1(old) - α * ( dL/dW1)
How do we calculate (dL/dW1)?
We can see that the value of W6 depends on the value of W4 and W5.
The value of W4 (and W5) depends on the value of W1, W2 and W3, in that order.
So, to calculate the commutation loss due to W1, we must take into account the commutation loss due to W6 and W4 (because they connect W1 to the output).
So
(𝕕L/𝕕W1) = (𝕕W6/𝕕W4) * (𝕕W4/𝕕W1)
This is known as the chain rule of differentiation.
As we go down the layers, the derivative decreased, so the value of
(𝕕W6/𝕕W4) is greater than (𝕕W4/𝕕W1).
The problem of vanishing gradient descent typically occurs when the activation function is sigmoidal. But why?
This is because the value of the derivative of a sigmoid function always lies between 0 and 0.25.
You can see the sigmoid function and its results using Matplotlib and this code.
Therefore, if the (dW6/dW4) value is 0.1 and the (dW4/dW1) value is 0.001, we can clearly see that the (dW/dW1) value is 0.0001.
Assume W1(old) = 2.5, α = 1, then
W1(new) = 2.5 - 0.0001 = 2.4999
You can clearly see this when we increase it. As the number of layers increases, the (dW/dW1) value and therefore the weight will decrease further.
This is called the vanishing gradient problem.
https://www.plagiarismremover.net/
File "C:\Users\LENOVO\AppData\Local\Temp/ipykernel_24900/3050170035.py", line 1 https://www.plagiarismremover.net/ ^ SyntaxError: invalid syntax
https://www.check-plagiarism.com/
File "C:\Users\LENOVO\AppData\Local\Temp/ipykernel_24900/3650554326.py", line 1 https://www.check-plagiarism.com/ ^ SyntaxError: invalid syntax